16 research outputs found

    Modèles de programmation des applications de traitement du signal et de l'image sur cluster parallèle et hétérogène

    Get PDF
    Since a decade, computing systems evolved to parallel and heterogeneous architectures. Composed of several nodes connected via a network and including heterogeneous processing units, clusters achieve high performances. To program these architectures, the user must rely on programming models such as MPI, OpenMP or CUDA. However, it is still difficult to conciliate productivity provided by abstracting the architectural specificities, and performances. In this thesis, we exploit the idea that a programming model specific to a particular domain of application can achieve these antagonist goals. In fact, by characterizing a family of application, it is possible to identify high level abstractions to efficiently model them. We propose two models specific to the implementation of signal and image processing applications on heterogeneous clusters. The first model is static. We enrich it with a task migration feature. The second model is dynamic, based on the StarPU runtime. Both models offer firstly a high level of abstraction by modeling image and signal applications as a data flow graph and secondly they efficiently exploit task, data and graph parallelisms. We validate these models with different implementations and comparisons including two real-world applications of images processing on a CPU-GPU cluster.Depuis une dizaine d'année, l'évolution des machines de calcul tend vers des architectures parallèles et hétérogènes. Composées de plusieurs nœuds connectés via un réseau incluant chacun des unités de traitement hétérogènes, ces grilles offrent de grandes performances. Pour programmer ces architectures, l'utilisateur doit s'appuyer sur des modèles de programmation comme MPI, OpenMP, CUDA. Toutefois, il est toujours difficile d'obtenir à la fois une bonne productivité du programmeur, qui passe par une abstraction des spécificités de l'architecture et performances. Dans cette thèse, nous proposons d'exploiter l'idée qu'un modèle de programmation spécifique à un domaine applicatif particulier permet de concilier ces deux objectifs antagonistes. En effet, en caractérisant une famille d'applications, il est possible d'identifier des abstractions de haut niveau permettant de les modéliser. Nous proposons deux modèles spécifiques au traitement du signal et de l'image sur cluster hétérogène. Le premier modèle est statique. Nous lui apportons une fonctionnalité de migration de tâches. Le second est dynamique, basé sur le support exécutif StarPU. Les deux modèles offrent d'une part un haut niveau d'abstraction en modélisant les applications de traitement du signal et de l'image sous forme de graphe de flot de données et d'autre part, ils permettent d'exploiter efficacement les différents niveaux de parallélisme tâche, données, graphe. Ces deux modèles sont validés par plusieurs implémentations et comparaisons incluant deux applications de traitement de l'image du monde réel sur cluster CPU-GPU

    SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

    No full text
    International audience—The biomedical imagery, the numeric communi-cations, the acoustic signal processing and many others digital signal processing applications (DSP) are present more and more everyday in the numeric world. They process growing data volume which is represented with more and more accuracy, and using complex algorithms with time constraints to satisfying. Con-sequently, a high requirement of computing power characterize them. To satisfy this need, it's inevitable today to use parallel and heterogeneous architectures in order to speed-up the processing, where the best examples are the supercomputers like "Tianhe-2" and "Titan" of the ranking top500. These architectures with their multi-core nodes supported by many-core accelerators offer a good response to this problem, but they are still hard to program in order to make performance because of lot of things like synchronization, the memory management, the hardware specifications . . . In the present work, we propose a high level programming model to implement easily and efficiently digital signal processing applications on heterogeneous clusters

    Task migration of DSP application specified with a DFG and implemented with the BSP computing model on a CPU-GPU cluster

    No full text
    International audienceNowadays computer applications are becoming heavier and require, at the same time, real-time results. The Heterogeneous clusters with their computing power represent a good solution to this request. However, it is possible that during the execution, a computing element of the cluster becomes defaulting, needs maintenance, or that the load needs to be re-balanced. . . In this paper, we propose a migration strategy for relocating the execution of a task to another computing element. In particular, we are interested in remap nodes of Data Flow Graph (DFG), representing Digital Signal Processing (DSP) application, onto heterogeneous (CPU-GPU) clusters while keeping up the flow of data and minimizing the temporal perturbation. For our approach, we give a lower bound for the flow of data after the migration and, validate it by the real-time construction of visual saliency map from video input

    Le modèle de programmation ORWL pour la parallélisation d'une application de suivi vidéo HD sur architecture multi-coeurs

    Get PDF
    accepted for publication in Compas'16National audienceGrâce à l'évolution des technologies de capture d'image et de vidéo il est possible aujourd'hui de collecter une quantité importante d'information sur le monde observé. En effet, des capteur d'images à résolution HD ou ultra HD peuvent produire plusieurs millions de pixels. Cela permet à des applications de la vidéo surveillance comme le suivi de mobiles de bénéficier de quantité de données supérieure afin de produire de meilleurs résultats. Dans ce contexte les architectures multi-cœurs représentent une bonne solution de calcul. Elles présente des ressources mémoire importantes capables d’accueillir ces données et de les traiter avec les nombreux cœurs les composant. Cependant, pour optimiser leurs performances, le développeur doit gérer plusieurs étapes de programmation. Pour faciliter la programmation de ces architectures, il est possible d'utiliser des modèles de programmation proposant des abstractions sur ces étapes de programmation. Dans cette étude, nous nous intéressons d'implémenter une application de suivi vidéo HD sur une architecture multi-cœurs en utilisant le modèle de programmation à base de tâches ORWL. Ce modèle nous permet de produire une implémentation efficace qui accélère le traitement tout en bénéficiant d'un niveau élevé d'abstraction

    Automatic, Abstracted and Portable Topology-Aware Thread Placement

    Get PDF
    International audienceEfficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work, we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the platform topology. Implemented in the back-end of our runtime system (ORWL), our approach was used to enhance the performance and the scalability of several unmodified ORWL-coded applications: matrix multiplication, a 2D stencil (Livermore Kernel 23), and a video tracking real world application. On two SMP machines with quite different hardware characteristics, our tests show spectacular performance improvements for these unmodified application codes due to a dramatic decrease of cache misses and pipeline stalls. A comparison to reference implementations using OpenMP confirms this performance gain of almost one order of magnitude

    A hierarchical model to manage hardware topology in MPI applications

    Get PDF
    International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance

    Un modèle hièrarchique pour la gestion de la topologie dans les applications MPI

    Get PDF
    The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy as become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.Le standard MPI est une contribution importante dans le domaine de la programmation parallèle. Il est destiné à l'écriture d'applications parallèles pour un large éventail d'architectures parallèles. L'arrivée des machines multicœur implique une compréhension plus fine de la topologie matérielle sous-jacente, notamment en ce qui concerne les hiérarchies mémoire et réseau. Or, dans son statut actuel, MPI ne permet pas de prendre ces aspects en compte. Nous détaillons dans cet article des modifications à MPI pour permettre la prise en compte de ces aspects afind'améliorer les performances applicatives

    Hardware topology management in MPI applications through hierarchical communicators

    Get PDF
    International audienceThe MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance

    Programming models for signal and image processing on parallel and heterogeneous architectures

    No full text
    Depuis une dizaine d'année, l'évolution des machines de calcul tend vers des architectures parallèles et hétérogènes. Composées de plusieurs nœuds connectés via un réseau incluant chacun des unités de traitement hétérogènes, ces grilles offrent de grandes performances. Pour programmer ces architectures, l'utilisateur doit s'appuyer sur des modèles de programmation comme MPI, OpenMP, CUDA. Toutefois, il est toujours difficile d'obtenir à la fois une bonne productivité du programmeur, qui passe par une abstraction des spécificités de l'architecture et performances. Dans cette thèse, nous proposons d'exploiter l'idée qu'un modèle de programmation spécifique à un domaine applicatif particulier permet de concilier ces deux objectifs antagonistes. En effet, en caractérisant une famille d'applications, il est possible d'identifier des abstractions de haut niveau permettant de les modéliser. Nous proposons deux modèles spécifiques au traitement du signal et de l'image sur cluster hétérogène. Le premier modèle est statique. Nous lui apportons une fonctionnalité de migration de tâches. Le second est dynamique, basé sur le support exécutif StarPU. Les deux modèles offrent d'une part un haut niveau d'abstraction en modélisant les applications de traitement du signal et de l'image sous forme de graphe de flot de données et d'autre part, ils permettent d'exploiter efficacement les différents niveaux de parallélisme tâche, données, graphe. Ces deux modèles sont validés par plusieurs implémentations et comparaisons incluant deux applications de traitement de l'image du monde réel sur cluster CPU-GPU.Since a decade, computing systems evolved to parallel and heterogeneous architectures. Composed of several nodes connected via a network and including heterogeneous processing units, clusters achieve high performances. To program these architectures, the user must rely on programming models such as MPI, OpenMP or CUDA. However, it is still difficult to conciliate productivity provided by abstracting the architectural specificities, and performances. In this thesis, we exploit the idea that a programming model specific to a particular domain of application can achieve these antagonist goals. In fact, by characterizing a family of application, it is possible to identify high level abstractions to efficiently model them. We propose two models specific to the implementation of signal and image processing applications on heterogeneous clusters. The first model is static. We enrich it with a task migration feature. The second model is dynamic, based on the StarPU runtime. Both models offer firstly a high level of abstraction by modeling image and signal applications as a data flow graph and secondly they efficiently exploit task, data and graph parallelisms. We validate these models with different implementations and comparisons including two real-world applications of images processing on a CPU-GPU cluster

    Optimizing Locality by Topology-aware Placement for a Task Based Programming Model

    Get PDF
    International audienceThe ordered read-write lock model (ORWL) is a modern framework that proposes high level abstractions for the decomposition of an application and for the management of synchronizations and communications. The implementation of the model reaches high performances thanks to a decentralized event-based runtime. In this paper, we propose to enrich ORWL by proposing a topology-aware placement module that is based on the Hardware Locality framework, HWLOC. The aim is double. On one hand we increase the abstraction and the portability of the framework, and on the other hand we enhance the performance of the model’s runtime. We propose a placement policy, that takes the characteristics of the application, of the runtime and of the architecture into account. We validate and compare our approach with the Livermore kernel23 benchmarks
    corecore